Lance - work4ai

Lance

https://lance-project.github.io/assets/text-to-video/videos/t2v-fg-sp-0000-opt.mp4

3Bのunified multimodal model

テキスト/画像/動画の理解/生成/編集を一つのモデルで行う

https://gyazo.com/91849ab21268b071816aac4d59b26785

統合モデルだが、完全にトークンを共有するのではなく、

理解はViT semantic token + next-token prediction

生成はVAE latent token + velocity prediction。

ViT由来の意味token、条件用clean VAE latent token、生成対象noisy VAE latent tokenを、位置と役割の両方で区別するためのRoPE拡張